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Abstract: The Square Kilometre Array and its pathfinders ASKAP and MeerKAT will produce 
prodigious amounts of data that necessitate automated source finding. The performance of automated 
source finders can be improved by pre-processing a dataset. In preparation for the WALLABY and 
DINGO surveys, we have used a test HI datacube constructed from actual Westerbork Telescope noise 
and WHISP HI galaxies to test the real world improvement of linear smoothing, the Duchamp source 
finder's wavelet de-noising, iterative median smoothing and mathematical morphology subtraction, on 
intensity threshold source finding of spectral line datasets. To compare these pre-processing methods 
we have generated completeness-reliability performance curves for each method and a range of input 
parameters. We find that iterative median smoothing produces the best source finding results for 
ASKAP HI spectral line observations, but wavelet de-noising is a safer pre-processing technique. 
In this paper we also present our implementations of iterative median smoothing and mathematical 
morphology subtraction. 
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1 Introduction 

Source finding reduces a dataset to a manageable ab- 
stract representation that is a collection of objects 
with physically meaningful properties. When a dataset 
becomes too large the dataset is virtually impossi- 
ble to work with directly, and the catalogue is the 
only method of data exploration. This is the case for 
the two Australian Square Kilometre Array Pathfinder 
(ASKAP) HI surveys, Wide-field ASK AP Legacy L- 
band All-sky Blind surveY (WALLABY) (|Koribalski et all 
(|2009T ): Koribalski, B., Staveley-Smith, L. et al., in 
preparation) and Deep Investigation of Neutral Gas 
Origins (DINGO). Individual ASKAP spectral line ob- 
servations will be at least 2,048 by 2,048 by 16,384 vox- 
els, which is 256GB (512GB) in float (double) preci- 
sion and only directly accessible using supercomputing 
facilities. 

The sheer size of the ASKAP spectral line obser- 
vations combined with the number of observations re- 
quired to carry out the WALLABY survey (~ 1200) 
necessitates automated source finding. An additional 
benefit is the reproducibility of automated source find- 
ers, which allows their performance to be incorporated 
into existing and future simulations of the WALLABY 
survey. The WALLABY team has been investigating 
existing source finding techniques as well as develop- 
ing novel methods. The essential metrics for assessing 
automated source finding are reliability and complete- 
ness. Completeness is the fraction of sources that it 
recovers, and reliability is the fraction of detections 
that are actual sources. All automated source finders 
can be characterised by a 'performance curve', which 
describes the combination of reliability and complete- 



ness that a source finder achieves on a given dataset. 

It is a common practise to pre-process a dataset 
before applying a source finding method. The goal 
of pre-processing is to improve both the completeness 
and reliability of the source finder. This is achieved 
by 'correcting' the dataset. In a 'corrected' dataset 
the noise behaves as your source finder assumes, the 
dataset is free from background structure and sources 
have maximised signal-to-noise ratios. 

It should be noted that the term 'signal-to- noise ra- 
tio' in this context does not account for the Jy/beam 
units of radio observations. Technically a radio obser- 
vation should be re-scaled for the new beam size when 
an observation is smoothed. This involves reversing 
the initial beam scaling, which in some circumstances 
increases the noise level more than it is minimised by 
smoothing. For the purposes of pre-processing and 
source finding though units are irrelevant. The signal- 
to-noise ratios discussed here therefore refer to the 
unsealed signal-to-noise ratios of radio observations, 
which are always enhanced by smoothing. 

In this paper we will compare four pre-processing 
methods for ASKAP HI datacubes: iterative median 
smoothing, mathematical morphology subtraction, wavelet 
de-noising and linear smoothing. We will compare 
these pre-processing methods by examining the effect 
they have on the performance curve of a simple inten- 
sity threshold source finder. We analyse the effect on 
an intensity threshold performance curve, because in- 
tensity thresholding is a t the core of most source find- 
ers eg. SExtractor (iBertin fc Arnoutsl H996) and 
SFlND (jHopkins et alj|2002i ). 

We are including linear smoothing and wavelet de- 
noising in our comparison, because they are among the 
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most commonly used pre-processing methods. These 
two methods take contrasting approaches. Linear smooth 
ing uses averaging or convolution to re-distribute the 
flux within the datacube so that noise fluctuations 
are reduced more than source signal, which results in 
increased source signal-to-noise ratios. Wavelet de- 
noising however tries to directly subtract noise from 
the datacube. A wavelet transform decomposes a dat- 
acube into signal on different scales at all positions 
within the datacube. Signal on scales smaller and 
larger than the expected size of sources can then be re- 
moved. Alternatively, the noise level on different scales 
can be measured from the wavelet transform, and only 
'significant' signal (as defined in some way by a user) 
on eac h scale is retained. We use the Duchamp source 
finder (jWhitind 120081 . 12012ft to implement both, be- 
cause Duchamp is not only a commonly used state-of- 
the-art source finder, but it is also the default ASKAP 
source finder. 

We are including iterative median smooth i ng in our 
comparison, because lArias-Castro fc Donohol l|2009T ) has 
shown that iterative median smoothing produces a larger 
gain in source signal-to-noise ratio than linear smooth- 
ing methods. The key is that calculating a median 
is a non- linear process that preserves source 'edges'. 
Source edges are preserved because median ca lcula- 
tions a re insensitive to sample outliers. Crucially. I Arias-Castro fc Donohol 
(I2009T ) found that only two iterations are required, so 
long as the first iteration uses the smallest smooth- 
ing kernel possible. This minimal number of iterations 
results in a reasonable computational load even when 
large smoothing kernels are used for the second itera- 
tion. 

We chose to test mathematical morphology sub- 
traction, because it is a proven technique for size fil- 
tering images. We can use mathematical morphology 
to filter out the small-scale information in a dataset 
to id entify large scale structure in the image l|Rudnickl 
2002). Subtracting this large scale structure can po- 
tentially improve reliability by re-normalising the dataset 
noise properties, so that the mean of the noise distri- 
bution is constant throughout the dataset. 

There are distinct disadvantages common to all of 
these pre-processing methods. First, a poor choice of 
smoothing kernel can actually decrease source signal- 
to-noise ratios when using linear smoothing, iterative 
median smoothing and mathematical morphology sub- 
traction. This is dealt with by using multiple smooth- 
ing kernels. This increases the computational load 
though, and the results of the multiple smoothing ker- 
nels need to be combined intelligently. Second, all 
of these pre-processing methods need to ac count for 



wavelet transform kernel from matching an astronom- 
ical source, which is either a positive feature (emission) 
or a negative feature (absorption). Consequently, any 
astronomical source will necessarily exist on multiple 
scales (and probably multiple locations). Depending 
on how the wavelet transform information is filtered 
to de-noise a datacube, this can have a negative effect 
on source finder performance. 

We will carry out our comparison of these source 
finder pre-processing methods usi ng the Westerbor k 
Telescope (WSRT) test datacube in lSerra et aD (|2Q12h . 
The WSRT test datacube was crea ted by injecting 
WHISP sources (jvan der Hulst 200^) into a datacube 
of real WSRT spectral datacube noise. This test dat- 
acube not only contains real sources embedded in real 
noise, but the resolution (both angular and spectral) 
and noise level closely matches that expected of the 
APERT1F and ASKAP telescopes. In particular, the 
30" Gaussian beam, 10" pixels, 3.86 km s _1 channels 
and 1.86 mjy /beam noise level of this test datacube, is 
designed to match WALLABY observations. This al- 
lows us to test the 'real world' performance of the var- 
ious pre-processing methods. We illustrate the scale 
of the WHISP sources and the test datacube using a 
channel map in Figure [T] 



Frequency: 1393.555 MHz 




Figure 1: This is a channel map of the WSRT 
test datacube used in this paper. The source in 
the centre of this channel map is one of the most 



sp 

datase ts having different types of dimensions. iFloer fc Winkell 
(|2012ft is a good example of a wavelet transform for HI 
datacubes that accounts for the difference between the 
RA, Dec angular dimensions and the frequency dimen- 
sion. 

There are two additional disadvantages of wavelet 
de-noising. Computing a wavelet transform can be 
much more computationally expensive than the other 
pre-processing methods. Additionally, a requirement 
of all wavelet transform kernels is that the integral 
of any kernel is zero. This prevents any individual 



spatially resolved sources. 



The rest of this paper is organised in the follow- 
ing way. We begin by presenting the implementation 
of iterative median smoothing and mathematical mor- 
phology that we used for our comparison in Section [2] 
Next we compare and analyse the performance impact 
of the various source finder pre-processing methods in 
Section [3] Then we finish in Section [4] with our con- 
clusions and recommendations. 
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2 Pre-processing implementa- 
tions 

2.1 Iterative median smoothing 

Iterative m edian smoothing is reviewed an d analysed 
in depth in lArias-Castro fc Donohd (|2009h . Here we 
present a brief overview of iterative median smoothing 
and our implementation. Iterative median smooth- 
ing is the process of repeatedly replacing each ele- 
ment of a dataset with the median of a region cen- 
tered on the element, using progre ssively larger re- 
gions. lArias-Castro fc Donohd lj2009T ) found that with 
the right choice of region size, only two iterations are 
needed to obtain near maximal performance from it- 
erative median smoothing. To do so, the first itera- 
tion needs to measure the median of the smallest re- 
gion possible, and the second iteration needs to mea- 
sure the median of a region matching the size and 
shape of the signal being optimised. The first pass re- 
moves elements that are outliers and the second pass 
re-distributes the flux, while preserving source edges, 
to improve source signal-to-noise ratio by averaging 
noise. 

In this paper we use a two-pass implementation of 
iterative median smoothing. The first pass uses only 
the element being processed and the six neighbouring 
voxels that share a face with it. This is a 3-D extension 
of 'four-connected' pixels. We chose a 3-D version of 
'four-connected' pixels for the first iteration, because 
this matches the pixel size of the beam's central com- 
ponent. This is sufficiently large to filter out individual 
noisy voxels in the presence of beam convolution (con- 
firmed by us visually). The second pass uses either 
a rectangular parallelepiped or an ellipsoid extending 
along the frequency axis as a smoothing element. The 
ellipsoid (rectangular parallelepiped) is defined using 
separate radii (lengths) for the frequency axis and an- 
gular axes, RA and Dec. 

We have developed software that efficiently applies 
two-pass iterative median smoothing using an initial 
six-connected voxel element followed by a n-channel 
rectangular parallelepiped or ellipsoid element. This 
software deals with large datacubes using a two-pronged 
approach. First, the software uses a 'buffer-and-shuflie' 
approach to minimise the memory overheads associ- 
ated with multiply smoothing the input datacube. Sec- 
ond, sufficiently large datacubes are broken up into 
manageable 'chunks', and processed sequentially. The 
use of a buffer-and-shuffle approach allows processing 
of files as large as three gigabytes on a 32-bit laptop 
before efficiency requires segmentation of a datacube. 

2.2 Mathematical morphology sub- 
traction 

Mathematical morphology is a technique for analysing 
the morphology of objects in images. The core of 
mathematical morphology is erosion and dilation with 
a kernel. These two non-linear operations can be com- 
bined in multiple ways, but the simplest combinations 
are erosion followed by dilation to 'open' an object 



and 'closing' an object by dilating then eroding. The 
easiest way to think of the open and close operations 
is the effect that they have on text. The open op- 
eration sharpens the characters by filtering out small 
scale structure. A consequence of the open operation 
is that it 'rounds' the remaining structure. The close 
operation by contrast blurs the characters. It ampli- 
fies small scale structure using the large scale structure 
as a guide. Unfortunately, sufficiently close characters 
will be merged into each other. 

Monochromatic images, such as HI datacubes, are 
processed using 'structuring element' kernels. Using 
a structuring element dilation is achieved by replac- 
ing an element with the maximum value in the region 
around it (specified by the kernel). Similarly, erosion 
is achieved by replacing an element with the minimum 
of the surrounding region. 

In this pap er we use the approach developed in 
Rudnick (2002). We use an open operation to filter the 
small scale structure out of the image and obtain an 
open image of the large scale structure. By subtract- 
ing this large scale structure from the original image, 
we can obtain a residual imag e of the small sc ale struc- 
ture. We use the approach in lRudnickl (|2002l ) because 
combining the open and residual images preserves the 
flux of the original image. 

3 Analysis 

We compare the effectiveness of various source finder 
pre-processing methods by constructing completeness- 
reliability performance curves for a simple intensity 
threshold source finder, after applying the various pre- 
processing methods. We have chosen to compare the 
performance curves instead of completeness and relia- 
bility for a given threshold, because each pre-processing 
method will alter the datacube and its noise distribu- 
tion in different ways. An arbitrary threshold (in units 
of noise level) will therefore not be consistent across 
the outputs of the various pre-processing methods. 

The simple intensity threshold source finder that 
we use here is a two step process. In the first step a 
CH — h code robustly measures the datacube's standard 
deviation from the interquartile range, and then se- 
lects all voxels greater than a user specified multiple of 
the standard deviation as source voxels. These flagged 
voxels are then combined into objects, merged, size 
filtered and turned into a catalo gue u s ing th e object 
generating library presented in lJurekl (|2012l ). Every 
catalogue used merging lengths of 1, 1 and 3 empty 
voxels along the RA, Dec and frequency axes. We size 
filtered every catalogue to only include object's con- 
taining 14 voxels, occupying 5 lines of sight and whose 
extent in the RA, Dec and frequency dimensions is at 
least 3 voxels. We chose these merging and size filter- 
ing parameters, because they are representative of the 
values that would be chosen by a user when trying to 
maximise completeness. 

Th e object generating library presented in lJurekl 
(|2012l ) can generate catalogues that include a sparse 
representation of each object's voxel mask. Us i ng th e 
same custom C++ software as iPopping et all ((2012), 
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we use these sparse representations to match the ob- 
ject's in the various output catalogues to the input cat- 
alogue on a voxel-to-voxel basis. From these matches 
we calculate the source finder performance metrics. 

The completeness-reliability performance curves that 
we measured for the various pre-processing methods 
are plotted in Figure We measured performance 
curves for all of the sources and for a subset of 'de- 
tectable' objects, which are defined as having peak 
signal-to-noise ratios greater than or equal to three. 
We refer to these two types of performance curves 
as the total and detectable performance curves. The 
input parameters we used with these pre-processing 
methods are listed in Table [1] in the Appendix. To 
obtain a meaningful comparison of the pre-processing 
methods, we used similar parameter values across the 
pre-processing methods. The choice of mathematical 
morphology opening subtraction kernels ranges from 
the the beam size to 20 times the beam size. This en- 
sures that our choice of opening kernels brackets the 
optimal size of th r ee tim es the source size, that was 
found bv lRudnickl (|2002l 1. 

Figure [2] shows that all of the pre-processing meth- 
ods, except the mathematical morphology opening sub- 
traction, produces a performance curve better than the 
default performance curve for the right choice of pre- 
processing parameters. The best performance curves 
are obtained by using iterative median smoothing or 
linear smoothing. Every method except for the linear 
smoothing can prove detrimental however if incorrect 
parameters are chosen. The iterative median smooth- 
ing has the most detrimental effect when a 5,5 kernel is 
used. This is as expected, because the kernel is larger 
than some source components. This is consistent with 
our observation that iterative median smoothing is al- 
ways an improvement when the kernel's spatial extent 
matches the beam size. Conversely, as the mathemati- 
cal morphology subtraction is more detrimental as the 
kernel size shrinks. We conjecture that the mathe- 
matical morphology subtraction performance curve is 
asymptoting towards the reference performance curve 
as the kernel approaches the size of the test datacube. 

In Figure |31 we have plotted a subset of the perfor- 
mance curves in Figure[2l which reflect the best perfor- 
mance curve generated by each pre-processing method. 
The best results are achieved for: linear smoothing 
with an 11 channel Hanning filter in frequency; math- 
ematical morphology opening subtraction with a sin- 
gle channel rectangle whose side length is 61 voxels; 
iterative median smoothing with a 3 by 3 by 11 rect- 
angular parallelepiped and wavelet reconstruction in 
either 1 or 3 dimensions with a 4-a threshold, without 
applying a minimum scale threshold. In the rest of our 
analysis, we will only use the wavelet reconstruction in 
one dimension to avoid redundancy. The parameters 
for these performance curves are highlighted in bold 
italics in Table [T] 

From the performance curves in Figure [3] we can 
draw four interesting conclusions. First, a datacube 
needs to contain sufficient large scale structure for math- 
ematical morphology opening subtraction to be worth- 
while, and our real world test datacube does not. Sec- 
ond, for our choice of parameters, the linear smoothing 
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Figure 2: The total (top) and detectable (bot- 
tom) performance curves measured for every pre- 
processing method and choice of input parameters. 
The solid line is the reference performance curve 
measured from the unprocessed WSRT test dat- 
acube. The red X, blue crosses, green triangles 
and hollow circles mark the performance curves 
resulting from mathematical morphology subtrac- 
tion, wavelet de-noising, linear smoothing and it- 
erative median smoothing, respectively. 



and iterative median smoothing produce better perfor- 
mance curves than Duchamp's 3-D wavelet de-noising. 
Third, the detectable performance curves show greater 
improvement than the total performance curves, be- 
cause pre-processing is more beneficial for sources that 
are easier to detect. Our final conclusion is that smooth- 
ing along the frequency axis contributes more to im- 
provement in the performance curves than smoothing 
spatially. Visual inspection of our test datacube re- 
veals that most of our sources are marginally resolved 
spatially, but well resolved spectrally. This means 
that a filter 'matching' our sources distribution is pro- 
viding the biggest improvement to the performance 
curves, which is what we would expect in the presence 
of Gaussian noise. We can therefore conclude that our 
real world test datacube's noise is sufficiently Gaus- 
sian that matched filtering is the optimal method for 
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Figure 3: A subset of the total (top) and de- 
tectable (bottom) performance curves in Figure 
[5] (symbols have the same meaning). These per- 
formance curves are the best performance curves 
obtained with each pre-processing method. 



source extraction. 

We will further analyse the subset of pre-processing 
results in Figure by comparing the effects of the 
different pre-processing methods on the fragmentation 
(fraction of multiply detected sources) and merging 
(fraction of sources detected as a single object) rates as 
well as the number of sources contributing to the max- 
imally merged object. The fragmentation and merging 
rates and maximally merged object are plotted in Fig- 
ure [4] as a function of completeness. 

Figure 0] reveals that at the highest completeness 
values, the completeness comes at the expense of not 
only reduced reliability, but also an increasing merg- 
ing rate. Eventually the merging becomes so bad that 
most of the sources are merged into a single object. 
This problem affects the linear smoothing the most, 
but it is no worse than the merging affecting the unpro- 
cessed datacube. The iterative median pre-processing 
produces the least merged source finding results at the 
highest completeness values. 

The wavelet de-noising produces the best (lowest) 
fragmentation rates at all completeness levels, but only 
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Figure 4: The merging rates, fragmentation rates 
and number of sources comprising maximally 
merged object for the total (top) and detectable 
(bottom) performance curve subset of Figure [31 
plotted against completeness. Symbols have the 
same meaning as Figures [5] and [3] 
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marginally better than the linear smoothing and iter- 
ative median smoothing. Note that we have excluded 
the fragmentation rates at the highest completeness 
levels in our assessment, because the fragmentation 
rates decline due to increasing merging rates. The 
increase in fragmentation rates at low completeness 
levels (seen in the mathematical morphology subtrac- 
tion) is expected when using an intensity threshold 
source finder. When using increasingly higher thresh- 
olds, only the peaks of a source are 'found', which re- 
sults in fragmentation. This fragmentation effect at 
low completeness levels is solved through the use of a 
'growth' threshold in addition to the threshold used to 
define source voxels. 

4 Conclusion 

We ha ve used the WSRT test datacube of ISerra et al.l 
(|2012h to test the real world performance improve- 
ment of linear smoothing, Duchamp's 3-D wavelet de- 
noising, iterative median smoothing and mathematical 
morphology subtraction, when using intensity thresh- 
olding to find sources in ASKAP HI spectral line obser- 
vations. We generated completeness-reliability perfor- 
mance curves for each pre-processing method and an 
unprocessed datacube, which we used as a reference, 
to investigate the effect of each pre-processing method 
for a range of input parameters. 

We found that the iterative median smoothing and 
linear smoothing produce the greatest improvement in 
source finder performance. We recommend that iter- 
ative median smoothing be used over linear smooth- 
ing though, because iterative median smoothing is less 
affected by merging and fragmentation. In our tests 
however the effect of iterative median smoothing on 
source finder performance proved to be more highly 
dependent upon the pre-processing parameters than 
the other methods. The performance improvement of- 
fered by Duchamp's 3-D wavelet de-noising was the 
least sensitive to the choice of input parameters. It is 
the safest pre-processing method. It should be noted 
however that using a smoothing kernel with a spatial 
extent smaller than or equal to the datacube's beam 
size, will in general improve source finder performance. 

We think the Gaussian nature of the WSRT noise is 
the reason that the linear smoothing and iterative me- 
dian smoothing produce the greatest source finder per- 
formance improvement, because it approximates matched 
filtering. For this reason, we do not expect our re- 
sults to be applicable to images or datacubes with non- 
Gaussian noise. We do however think that the edge- 
preserving nature of the iterative median transform is 
the reason that it does not suffer from fragmentation 
and merging as badly as linear smoothing, and that 
this result is applicable to all images and datacubes. 
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Table 1: Performance curves were measured for 
the pre-processing methods using the combina- 
tions of input parameters in this table. Every com- 
bination of parameters in a given row has been 
used. All kernel sizes are given in voxels. The 
parameter combinations that produced the best 
performance curve for each pre-processing method 
(plotted in Figure [3]), are highlighted in bold ital- 
ics. 

Linear smoothing 



smoothing axis kernel kernel size 



frequency harming 


3, 7, 11 


RA & Dec Gaussian 


3, 5 


Wavelet de-noising 


dimensionality min. scale kept 


threshold 


(voxels) 


(sigma) 


1,3 J, 3 


2, A 


Iterative median smoothing 


kernel angular size 


freq. size 


rect. parallel. 3, 5, 7 


3, 5, 7, 11 


11 


11 


ellipsoid 3, 5, 7 


3, 5, 7, 11 


11 


11 


Mathematical morphology subtraction 


kernel angular size 


freq. size 


rect. parallel. 3 


1, 3 


5 


1, 5 


11 


1, 11 


21 


1, 21 


41, 61 


1 



